Start up

yufan_yin_week0: 9.9. - 15.9.2020

Also see in the page to my course diary: https://yufanyin.github.io/datavis-R/

and the repository: https://github.com/yufanyin/datavis-R

.1 Describe my dataset

Structure of the data

learning2019 <- read.csv(file = "D:/Users/yinyf/datavis-R/week0/learning2019.csv", stringsAsFactors = TRUE) 
str(learning2019)
## 'data.frame':    218 obs. of  17 variables:
##  $ 锘縞luster     : int  3 2 1 1 3 1 2 2 1 3 ...
##  $ unref          : num  4 2 3 2 3 ...
##  $ deep           : num  3.5 4.25 3.75 4.25 3.25 3.5 4.25 4.25 4 4 ...
##  $ orga           : num  3.33 3 4.33 3.67 2.67 ...
##  $ blocks         : num  3.33 3.67 3.67 3 3.67 ...
##  $ procrastination: num  3.25 4.25 3.75 2.5 4.25 3.5 3.5 4.25 3.25 2.5 ...
##  $ perfectionism  : num  3.67 3.33 3.33 2.67 2.33 ...
##  $ innateability  : num  1 1.5 3 1.5 2.5 2 2 1 2.5 1 ...
##  $ ktransforming  : num  4 3.67 3.67 3.33 4 ...
##  $ productivity   : num  1.25 2 1.25 2.25 2.25 2.5 3 2.25 2.25 3.75 ...
##  $ gender         : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ studentstatus  : int  1 1 1 1 1 1 1 1 1 3 ...
##  $ studylength    : int  39 51 3 3 15 3 3 3 3 3 ...
##  $ writingcourse  : int  2 3 4 0 0 11 0 0 44 35 ...
##  $ monthsamel     : int  2 2 NA 0 NA 2 4 NA 3 2 ...
##  $ no             : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ faculty        : int  2 8 5 9 2 6 4 4 4 9 ...

The aim of the study is to investigate the interrelationships between the approaches to learning and conceptions of academic writing among international university students. Altogether 218 international students of the university participated in the study in 2018 and 2019. Students were divided into homogeneous groups based on their Z scores on the three approaches to learning. Then we compare mean differences and ANOVA results between the profiles.

The data ‘learning2019’ consists of 218 observations and 17 variables. It contains their scores of approaches to learning (different ways that students process information: unreflective studying, deep approach to learning and organised studying), conceptions of academic writing (blocks, procrastination, perfectionism, innate ability, knowledge transforming and productivity), and some background information (categorical variables, eg:gender, age, faculty, student status and study length).

The explanation of some columns are as follows. Each of them was average value of 2-4 questions in 5-point Likert scale (1= totally disagree, 5 = fully agree).

  • “unref”: relying on memorisation in the learning process, lacking the reflective approach to studying and applying the fragmented knowledge base.

  • “deep”: comprehending the intentional content, using evidence and integrating with previous knowledge.

  • “orga”: time management, study organisation, effort management and concentration.

  • “blocks”: the inability to write productively whose reason is not intellectual capacity or literary skills.

  • “procrastination”: failing to start or postponing tasks like preparing for exams and doing homework.

  • “perfectionism”: setting overly high standards, pursuing flawlessness, and evaluating one’s behavior critically.

  • “innateability”: writing is a skill which “is determined at birth” or “cannot be taught or developed”.

  • “ktransforming”: (knowledge transforming) using writing for developing knowledge and generating new ideas and in the reflective and dialectic processes.

  • “productivity”: (sense of productivity) part of self-efficacy in writing.

.2 My previous experience in R

  • I understand the basics of data wrangling.

  • I learned to use R to conduct anayses such as clustering and classification not very proficiently.

Because I attended the course “Introduction to Open Data Science” (HYMY-909, 5 cr) last autumn. Here are the link to my github repository:

https://github.com/yufanyin/IODS-project

and my course diary:

https://yufanyin.github.io/IODS-project/

.3 Expectations for this course

  • To learn practical data visualization skills using R and the ggplot2 -library. I know little in data visualization.

  • To learn about good data visualization and avoid bad/incorrect operation.

  • To produce rich, accurate and concise visualizations using my own data. I have found the proper method to deal with my own data and conducted using SPSS. Attending this course can help me produce better visualizations, which will benefit me a lot when I submit my FIRST article at the end of this year.


Week 1 Exercises

yufan_yin_week1: 16.9. - 21.9.2020

Also see in the page to my course diary: https://yufanyin.github.io/datavis-R/

(It is the habit because of another course.)

Exercise 1 Calculate and print

# Create a vector named my_vector. It should have 7 numeric elements.
my_vector <- c(20, 14, 18, 14, 10, 16, 16)

# Print your vector
my_vector
## [1] 20 14 18 14 10 16 16
# Calculate the minimum, maximum, and median values of your vector
summary(my_vector)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.00   14.00   16.00   15.43   17.00   20.00
# Print "The median value is XX"
mean_exercise1 <- mean(my_vector) # Output from functions can be saved to objects
paste("The median value is ", mean_exercise1) # Use the paste() function to print the object with text
## [1] "The median value is  15.4285714285714"

Exercise 2 Combine vectors into one data frame

# Create another vector named my_vector_2. It should have the elements of my_vector divided by 2.
my_vector_2 <- my_vector/2 # Access individual elements of a vector with indices
my_vector_2
## [1] 10  7  9  7  5  8  8
# Create a vector named my_words. It should have 7 character elements.
my_words <- c("swan", "goose", "mallard", "blue_tit", "philomelos", "sparrow", "gull")

# Combine my_vector and my_words into a data frame.
df <- data.frame(my_vector, my_words)
df
##   my_vector   my_words
## 1        20       swan
## 2        14      goose
## 3        18    mallard
## 4        14   blue_tit
## 5        10 philomelos
## 6        16    sparrow
## 7        16       gull
# Show the structure of the data frame.
str(df)
## 'data.frame':    7 obs. of  2 variables:
##  $ my_vector: num  20 14 18 14 10 16 16
##  $ my_words : chr  "swan" "goose" "mallard" "blue_tit" ...

Exercise 3 Use filter() to print

library(tidyverse)
## -- Attaching packages -------------------------------------------------------------------------------- tidyverse 1.3.0 --
## √ ggplot2 3.3.2     √ purrr   0.3.4
## √ tibble  3.0.3     √ dplyr   1.0.2
## √ tidyr   1.1.2     √ stringr 1.4.0
## √ readr   1.3.1     √ forcats 0.5.0
## -- Conflicts ----------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
# Use the head() function to print the first 3 rows of your data frame.
head(df) #How to print the first 3 rows instead of 5?
##   my_vector   my_words
## 1        20       swan
## 2        14      goose
## 3        18    mallard
## 4        14   blue_tit
## 5        10 philomelos
## 6        16    sparrow
# Create a new variable to the data frame which has the values of my_vector_2 (remember to save the new variable to the data frame object).
pair <- c(my_vector_2)
pair
## [1] 10  7  9  7  5  8  8
df2 <- data.frame(df,pair)
df2
##   my_vector   my_words pair
## 1        20       swan   10
## 2        14      goose    7
## 3        18    mallard    9
## 4        14   blue_tit    7
## 5        10 philomelos    5
## 6        16    sparrow    8
## 7        16       gull    8
# Use filter() to print rows of your data frame greater than the median value of my_vector.
df2 %>% filter(df2 > mean(my_vector))
##   my_vector my_words pair
## 1        20     swan   10
## 2        18  mallard    9
## 3        16  sparrow    8
## 4        16     gull    8

Week 2 Exercises

yufan_yin_week2: 23.9. - 28.9.2020

Also see in the page to my course diary: https://yufanyin.github.io/datavis-R/

Exercise 1

1.1 Loading libraries and suppressing any output messages in the chunk settings

Create a new code chunk where you load the tidyverse package. In the chunk settings, suppress any output messages.

1.2 Reading the data

The tibble df has 60 observations (rows) of variables (columns) group, gender, age, score1 and score2 (continuous scores from two tests). Each row represents one participant.

df
## # A tibble: 60 x 4
##    group gender score1 score2          
##    <int> <chr>   <dbl> <chr>           
##  1     2 F        18.7 14.7563711082321
##  2     1 M        20.1 15.1463059324341
##  3     2 F        17.4 19.0025387614538
##  4     1 M        18.7 15.5693261509451
##  5     2 F        18.5 16.7322250273729
##  6     1 999      16.9 16.4511010915052
##  7     2 M        20.4 15.1008590050657
##  8     1 F        20.3 15.191041952879 
##  9     1 F        19.4 13.9717194882152
## 10     2 M        21.2 22.6918520246433
## # ... with 50 more rows

There is something to fix in three of the variables. Explore the data and describe what needs to be corrected.

Hint: You can use e.g. str(), distinct(), and summary() to explore the data.

str(df)
## tibble [60 x 4] (S3: tbl_df/tbl/data.frame)
##  $ group : int [1:60] 2 1 2 1 2 1 2 1 1 2 ...
##  $ gender: chr [1:60] "F" "M" "F" "M" ...
##  $ score1: num [1:60] 18.7 20.1 17.4 18.7 18.5 ...
##  $ score2: chr [1:60] "14.7563711082321" "15.1463059324341" "19.0025387614538" "15.5693261509451" ...
summary(df)
##      group        gender              score1         score2         
##  Min.   :1.0   Length:60          Min.   :14.17   Length:60         
##  1st Qu.:1.0   Class :character   1st Qu.:16.85   Class :character  
##  Median :1.5   Mode  :character   Median :17.61   Mode  :character  
##  Mean   :1.5                      Mean   :17.89                     
##  3rd Qu.:2.0                      3rd Qu.:19.01                     
##  Max.   :2.0                      Max.   :21.53
distinct(df)
## # A tibble: 60 x 4
##    group gender score1 score2          
##    <int> <chr>   <dbl> <chr>           
##  1     2 F        18.7 14.7563711082321
##  2     1 M        20.1 15.1463059324341
##  3     2 F        17.4 19.0025387614538
##  4     1 M        18.7 15.5693261509451
##  5     2 F        18.5 16.7322250273729
##  6     1 999      16.9 16.4511010915052
##  7     2 M        20.4 15.1008590050657
##  8     1 F        20.3 15.191041952879 
##  9     1 F        19.4 13.9717194882152
## 10     2 M        21.2 22.6918520246433
## # ... with 50 more rows

The dataset df consists of 60 observations and 5 variables.It contains the membership of group, gender, age, score1, score2.

Exercise 2

2.1 Tidying data

Make the corrections you described above.

df <- df %>%
  mutate(gender = na_if(gender, 999)) # recode 999 to NA (missing)
  df$score2 <- as.numeric(df$score2) # convert a character vector to a numeric vector

2.2 Counting observations by grouping variables

Count observations by group and gender. Arrange by the number of observations (ascending).

df %>%
  count(group, gender) %>% # count() is a combination of group_by() and tally()
  arrange(desc(n)) %>% # OR: "%>% floor()"?
  arrange(group)
## # A tibble: 6 x 3
##   group gender     n
##   <int> <chr>  <int>
## 1     1 M         14
## 2     1 F         13
## 3     1 <NA>       3
## 4     2 F         15
## 5     2 M         14
## 6     2 <NA>       1

Exercise 3

3.1 Creating a new variable: the difference between scores

Create a new variable, score_diff, that contains the difference between score1 and score2.

df$score_diff <- df$score1 - df$score2

3.2 Computing the means: using summarise() to take multiple variables in one go

Compute the means of score1, score2, and score_diff.

Hint: Like mutate(), summarise() can take multiple variables in one go.

df %>%
  summarise(score1_mean = mean(score1), score2_mean = mean(score2), score_diff_mean = mean(score_diff))
## # A tibble: 1 x 3
##   score1_mean score2_mean score_diff_mean
##         <dbl>       <dbl>           <dbl>
## 1        17.9        16.1            1.82

3.3 Computing the means by grouping variable

Compute the means of score1, score2, and score_diff by gender.

grouped_df <- df %>%
  group_by(gender)

grouped_df %>%
  summarise(score1_mean = mean(score1), score2_mean = mean(score2), score_diff_mean = mean(score_diff))
## # A tibble: 3 x 4
##   gender score1_mean score2_mean score_diff_mean
##   <chr>        <dbl>       <dbl>           <dbl>
## 1 F             17.9        16.3            1.63
## 2 M             18.1        16.0            2.08
## 3 <NA>          16.4        15.0            1.34

Exercise 4

4.1 Creating an x-y scatter plot

Using ggplot2, create a scatter plot with score1 on the x-axis and score2 on the y-axis.

df %>%
  ggplot(aes(score1, score2)) + # x = score1, y = Sscore2
  geom_point()

4.2 Setting colour based on grouping variable, figure width and height

Continuing with the previous plot, colour the points based on gender.

Set the output figure width to 10 and height to 6.

df %>%
  ggplot(aes(score1, score2, color = gender)) + # x = score1, y = score2
  geom_point()

Exercise 5

Note: I did this part in another rmd file named ‘index’.

see: https://github.com/yufanyin/datavis-R/blob/master/index.Rmd

5.1 Metadata section

Add the author (your name) and date into the metadata section. Create a table of contents.

5.2 Knitting

Knit your document to HTML by changing html_notebook to html_document in the metadata, and pressing Knit.

See the results in my course diary: https://yufanyin.github.io/datavis-R/


Week 3 Exercises

yufan_yin_week3: 29.9. - 5.10.2020

Also see in the page to my course diary: https://yufanyin.github.io/datavis-R/

library(tidyverse)

Exercise 1 Create categorical variable and use distinct()

1.1 Reading the data

Read the data into R. It have 211 observations of 17 variables.

learning2019 <- read.csv(file = "D:/Users/yinyf/datavis-R/week0/learning2019_week3.csv", stringsAsFactors = TRUE)
learning19 <- learning2019 %>%
  mutate(studylength = as.numeric(studylength),
         writingcourse = as.numeric(writingcourse))

str(learning19)
## 'data.frame':    206 obs. of  17 variables:
##  $ 锘縩o          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ cluster        : int  3 2 1 1 3 1 2 2 1 3 ...
##  $ unref          : num  4 2 3 2 3 2.67 1 2.33 3 3.67 ...
##  $ deep           : num  3.5 4.25 3.75 4.25 3.25 3.5 4.25 4.25 4 4 ...
##  $ orga           : num  3.33 3 4.33 3.67 2.67 4 2.33 3.33 4 3.67 ...
##  $ blocks         : num  3.33 3.67 3.67 3 3.67 4 2.67 2.33 3.33 2.67 ...
##  $ procrastination: num  3.25 4.25 3.75 2.5 4.25 3.5 3.5 4.25 3.25 2.5 ...
##  $ perfectionism  : num  3.67 3.33 3.33 2.67 2.33 2.33 4 2.67 3 3.33 ...
##  $ innateability  : num  1 1.5 3 1.5 2.5 2 2 1 2.5 1 ...
##  $ ktransforming  : num  4 3.67 3.67 3.33 4 3.33 2 4.33 4 4.33 ...
##  $ productivity   : num  1.25 2 1.25 2.25 2.25 2.5 3 2.25 2.25 3.75 ...
##  $ gender         : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ studentstatus  : int  1 1 1 1 1 1 1 1 1 2 ...
##  $ studylength    : num  39 51 3 3 15 3 3 3 3 3 ...
##  $ writingcourse  : num  2 3 4 0 0 11 0 0 44 35 ...
##  $ monthsamel     : int  2 2 NA 0 NA 2 4 NA 3 2 ...
##  $ faculty        : int  2 8 5 9 2 6 4 4 4 9 ...

1.2 Creating categorical variable

For my data, studylength is more suitable to be the categorical variable than age. It discribes how many months that students have studied in the university.

Cut the continuous variable studylength into a categorical variable studylength_group. Use ggplot2’s cutting function: cut_number() makes n groups with (approximately) equal number of observations.

Count observations by studylength group.

library(ggplot2)
learning19 %>%
  mutate(score_group_test = cut_width(studylength, 12, boundary = 0)) %>% # range width is (max - min) / number of groups
  count(score_group_test)
##   score_group_test   n
## 1           [0,12] 102
## 2          (12,24]  47
## 3          (24,36]  19
## 4          (36,48]  14
## 5          (48,60]  16
## 6          (60,72]   5
## 7          (72,84]   2
## 8        (168,180]   1
library(ggplot2)
learning19 %>%
  mutate(studylength_group = cut_number(studylength, 3)) %>% # each group has about 206 / 3 = 68 observations
  count(studylength_group)
##   studylength_group  n
## 1             [2,7] 71
## 2            (7,17] 67
## 3          (17,172] 68

Save the results with labels to the data.

learning19 <- learning19 %>%
  mutate(studylength_group = cut_number(studylength, 3,
                                 labels = c('-7','8-17','18-')))
learning19 %>% 
  distinct(studylength_group)
##   studylength_group
## 1               18-
## 2                -7
## 3              8-17

Exercise 2 Bar plots: geom_col()

The chunk below is supposed to produce a plot but it has some errors.

The figure should be a scatter plot of cluster (different student profiles) on the x-axis and blocks on the y-axis, with points coloured by studylength_group (3 levels). It should also have three linear regression lines, one for each of the education levels.

Fix the code to produce the right figure.

What happens if you use geom_jitter() instead of geom_point()?

Hint: Examine the code bit by bit: start by plotting just the scatter plot without geom_smooth(), and add the regression lines last.

learning19 %>% 
  ggplot(aes(cluster, blocks, fill = studylength_group)) + 
  geom_col(position = "dodge") + 
  geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

learning19 %>%
  ggplot(aes(cluster, blocks)) + 
  geom_col() +
  facet_wrap(~studylength_group)

Exercise 3 Bar plots: geom_col()

3.1

Calculate the mean, standard deviation (sd), and number of observations (n) of score on blocks by student profiles and study-length group. Also calculate the standard error of the mean (by using sd and n). Save these into a new data frame (or tibble) named cluster_blocks_stats.

cluster_blocks_stats <- learning19 %>%
  group_by(cluster, studylength_group, .drop = FALSE) %>% # there are no observations some of the combinations, but we don't drop them
  summarise(mean_blocks = mean(blocks),
            sd_blocks = sd(blocks),
            n = n()) %>%
  ungroup()
## `summarise()` regrouping output by 'cluster' (override with `.groups` argument)
cluster_blocks_stats
## # A tibble: 9 x 5
##   cluster studylength_group mean_blocks sd_blocks     n
##     <int> <fct>                   <dbl>     <dbl> <int>
## 1       1 -7                       2.48     0.981    31
## 2       1 8-17                     2.49     0.870    37
## 3       1 18-                      2.38     0.685    26
## 4       2 -7                       2.85     0.922    27
## 5       2 8-17                     2.53     0.775    22
## 6       2 18-                      2.59     0.936    27
## 7       3 -7                       3.44     0.906    13
## 8       3 8-17                     2.88     1.15      8
## 9       3 18-                      3.04     0.845    15
learning19 %>%
  ggplot(aes(cluster, blocks)) + 
  geom_col() +
  facet_wrap(~studylength_group)

3.2

Using cluster_blocks_stats, plot a bar plot that has cluster on the x-axis, mean score of blocks on the y-axis, and studylength levels in subplots (facets).

Use geom_errorbar() to add error bars that represent standard errors of the mean.

learning19 %>%
  ggplot(aes(cluster, blocks)) + 
  geom_bar(stat = "summary", fun.data = "mean_se") +
  facet_wrap(~studylength_group)

  stat_summary(geom = "errorbar", fun.data = "mean_se") 
## geom_errorbar: na.rm = FALSE, orientation = NA
## stat_summary: fun.data = mean_se, fun = NULL, fun.max = NULL, fun.min = NULL, fun.args = list(), na.rm = FALSE, orientation = NA
## position_identity

Exercise 4 Boxplots

4.1

Create a figure that has boxplots of cluster (x-axis) by blocks (y-axis).

Note: What does ‘Ord.factor’ mean? I do not know how to change the type of the variable cluster.

learning19 %>%
  ggplot(aes(cluster, blocks)) + 
  geom_boxplot() +
  facet_wrap(~studylength_group)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

4.2

Group the data by cluster and add mean score of blocks by cluster to a new column mean_score. Do this with mutate() (not summarise()).

Reorder the levels of cluster based on mean_score.

Hint: Remember to ungroup after creating the mean_score variable.

Note: Maybe such types of the variables in my data is not suitable for these operation.

Exercise 5

Using the data you modified in exercise 4.2, plot mean scores (x-axis) by cluster (y-axis) as points. The clusters should be ordered by mean score.

Use stat_summary() to add error bars that represent standard errors of the mean.

Hint: Be careful which variable - mean_score or score - you’re plotting in each of the geoms.

Note: Maybe the variables in my data is not suitable for such operation.


Week 4 Exercises

yufan_yin_week4: 6.10. - 12.10.2020

Also see in the page to my course diary: https://yufanyin.github.io/datavis-R/

Exercise 1 Histograms and density plots

1.1 Reading the data

Read the region_scores.csv data

region_scores <- read.csv(file = "D:/Users/yinyf/datavis-R/week4/region_scores.csv", stringsAsFactors = TRUE)
region_scores <- region_scores %>%
  mutate(id = as.character(id),
         region = factor(region),
         education = factor(education, ordered = TRUE),
         gender = factor(gender))

glimpse(region_scores)
## Rows: 240
## Columns: 6
## $ id        <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", ...
## $ region    <fct> South Karelia, Satakunta, Kymenlaakso, South Karelia, Sou...
## $ education <ord> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ gender    <fct> M, F, M, F, F, F, M, F, M, M, F, M, F, M, M, M, F, M, F, ...
## $ age       <int> 56, 41, 48, 41, 35, 60, 28, 28, 48, 51, 45, 55, 41, 24, 6...
## $ score     <dbl> 4.268811, 5.646586, 6.949019, 7.096777, 6.990985, 5.26766...

Cutting values (score) into intervals

to groups of width 10

region_scores %>%
  mutate(score_group = cut_width(score, 10, boundary = 0)) %>% 
  count(score_group)
##   score_group   n
## 1      [0,10]  55
## 2     (10,20] 154
## 3     (20,30]  31
region_scores <- region_scores %>%
  mutate(score_group = cut_width(score, 10, boundary = 0, 
                                 labels = c('-10','11-20','21-'))) 
region_scores %>% 
  distinct(score_group)
##   score_group
## 1         -10
## 2       11-20
## 3         21-

Column score_group is not found.

region_scores2 <- region_scores %>%
  group_by(education, score_group, .drop = FALSE) %>%
  summarise(mean_age = mean(age),
            sd_age = sd(age),
            n = n()) %>%
  ungroup()
## `summarise()` regrouping output by 'education' (override with `.groups` argument)
region_scores2
## # A tibble: 9 x 5
##   education score_group mean_age sd_age     n
##   <fct>     <fct>          <dbl>  <dbl> <int>
## 1 1         -10             39.5  10.1     46
## 2 1         11-20           38.8  10.2     39
## 3 1         21-            NaN    NA        0
## 4 2         -10             45     9.27     9
## 5 2         11-20           42.2   9.61    65
## 6 2         21-             39.3   7.57     3
## 7 3         -10            NaN    NA        0
## 8 3         11-20           40.1  10.4     50
## 9 3         21-             37.4   8.97    28

1.2 Histograms

Create a figure that shows the distributions (density plots or histograms) of age and score in separate subplots (facets). What do you need to do first?

Note: I’m not sure the group varible to create subplots.

In the figure, set individual x-axis limits for age and score by modifying the scales parameter within facet_wrap().

Question: What went wrong when I used facet_wrap() but saw the warning ‘Layer 1 is missing score_group(or other group variable)’ ? I met last week, too. I saved score_group.

region_scores %>%
  ggplot(aes(age, fill = score_group)) + 
  geom_histogram(position = "identity", alpha = .5, binwidth = 1) 

(Try more as a reminder in future)

region_scores %>%
  ggplot(aes(age, fill = gender)) + 
  geom_histogram(position = "identity", alpha = .5, binwidth = 1) 

region_scores %>%
  ggplot(aes(score, fill = gender)) + 
  geom_histogram(position = "identity", alpha = .5, binwidth = 1) 

1.3 Density plots

Note: I do not understand the meaning of y-axis in such density plots.

region_scores %>%
  ggplot(aes(age, fill = gender)) + 
  geom_density(alpha = .5) 

region_scores %>%
  ggplot(aes(score, fill = gender)) + 
  geom_density(alpha = .5) 

Exercise 2 Gather: wide-to-long, spread: long-to-wide and scatter plot

In this exercise, you will use the built-in iris dataset.

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

2.1 Make data into long format

Make the data into long format: gather all variables except species into new variables var (variable names) and measure (numerical values). You should end up with 600 rows and 3 columns (Species, var, and measure). Assign the result into iris_long.

iris_long <- iris %>%
  gather(var, measure, -Species) 
str(iris_long)
## 'data.frame':    600 obs. of  3 variables:
##  $ Species: Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ var    : chr  "Sepal.Length" "Sepal.Length" "Sepal.Length" "Sepal.Length" ...
##  $ measure: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...

2.2 Spread: long-to-wide

In iris_long, separate var into two variables: part (Sepal/Petal values) and dim (Length/Width).

Then, spread the measurement values to new columns that get their names from dim. You must create row numbers by dim group before doing this.

You should now have 300 rows of variables Species, part, Length and Width (and row numbers). Assign the result into iris_wide.

Note: It was a bit complex than the example. I tried many times but failed. So I kept some of the codes in the following chunk.

iris_long %>%
  group_by(Species) %>%
  mutate(row = row_number()) %>%
  ungroup %>%
  spread(?, ?) %>%
  select(-row)

However,

Must extract column with a single valid subscript. x Subscript `var` has the wrong type `data.frame<Sepal.Width:double>`. i It must be numeric or character.

Or:

iris_long %>%
  pivot_wider(names_from = c(var),
  values_from = measure) 
## Warning: Values are not uniquely identified; output will contain list-cols.
## * Use `values_fn = list` to suppress this warning.
## * Use `values_fn = length` to identify where the duplicates arise
## * Use `values_fn = {summary_fun}` to summarise duplicates
## # A tibble: 3 x 5
##   Species    Sepal.Length Sepal.Width Petal.Length Petal.Width
##   <fct>      <list>       <list>      <list>       <list>     
## 1 setosa     <dbl [50]>   <dbl [50]>  <dbl [50]>   <dbl [50]> 
## 2 versicolor <dbl [50]>   <dbl [50]>  <dbl [50]>   <dbl [50]> 
## 3 virginica  <dbl [50]>   <dbl [50]>  <dbl [50]>   <dbl [50]>

There is still error.

2.3 Scatter plot

Using iris_wide, plot a scatter plot of length on the x-axis and width on the y-axis. Colour the points by part.

iris_wide %>%
  ggplot(aes(Length, Width), color = Species) + # x = length, y = width
  geom_point()

Exercise 3 Read and summarize my own data

3.1 Reading my own data

Import your data into R. Check that you have the correct number of rows and columns, column names are in place, the encoding of characters looks OK, etc.

learning2019_w4 <- read.csv(file = "D:/Users/yinyf/datavis-R/week0/learning2019_week4.csv", stringsAsFactors = TRUE)

3.2

Print the structure/glimpse/summary of the data. Outline briefly what kind of variables you have and if there are any missing or abnormal values. Make sure that each variable has the right class (numeric/character/factor etc).

learning_w4 <- learning2019_w4 %>%
  mutate(studylength = as.numeric(studylength),
         writingcourse = as.numeric(writingcourse))
str(learning_w4)
## 'data.frame':    206 obs. of  10 variables:
##  $ 锘縞luster     : int  3 2 1 1 3 1 2 2 1 3 ...
##  $ unref          : num  4 2 3 2 3 2.67 1 2.33 3 3.67 ...
##  $ deep           : num  3.5 4.25 3.75 4.25 3.25 3.5 4.25 4.25 4 4 ...
##  $ orga           : num  3.33 3 4.33 3.67 2.67 4 2.33 3.33 4 3.67 ...
##  $ blocks         : num  3.33 3.67 3.67 3 3.67 4 2.67 2.33 3.33 2.67 ...
##  $ procrastination: num  3.25 4.25 3.75 2.5 4.25 3.5 3.5 4.25 3.25 2.5 ...
##  $ gender         : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ studentstatus  : int  1 1 1 1 1 1 1 1 1 2 ...
##  $ studylength    : num  39 51 3 3 15 3 3 3 3 3 ...
##  $ writingcourse  : num  2 3 4 0 0 11 0 0 44 35 ...

Exercise 4 Counting observations by grouping variables

Pick a few (2-5) variables of interest from your data (ideally, both categorical and numerical).

For categorical variables, count the observations in each category (or combination of categories). Are the frequencies balanced?

learning19_w4 %>%
  count(cluster, gender) %>%
  arrange(desc(n)) %>%
  arrange(cluster)

Error: Must group by variables found in .data. * Column cluster is not found. Neither is learning19_w4[1]. Well… I’m not very angry.

For numerical variables, compute some summary statistics (e.g. min, max, mean, median, SD) over the whole dataset or for subgroups. What can you say about the distributions of these variables, or possible group-wise differences?

Overall:

summary(learning_w4)
##    锘縞luster        unref            deep            orga      
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:1.670   1st Qu.:3.750   1st Qu.:2.670  
##  Median :2.000   Median :2.000   Median :4.000   Median :3.330  
##  Mean   :1.718   Mean   :2.178   Mean   :4.007   Mean   :3.411  
##  3rd Qu.:2.000   3rd Qu.:2.670   3rd Qu.:4.500   3rd Qu.:4.000  
##  Max.   :3.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##      blocks      procrastination     gender      studentstatus  
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:2.500   1st Qu.:1.000   1st Qu.:2.000  
##  Median :2.670   Median :3.250   Median :2.000   Median :2.000  
##  Mean   :2.655   Mean   :3.212   Mean   :1.714   Mean   :1.767  
##  3rd Qu.:3.330   3rd Qu.:3.750   3rd Qu.:2.000   3rd Qu.:2.000  
##  Max.   :5.000   Max.   :5.000   Max.   :2.000   Max.   :2.000  
##   studylength     writingcourse   
##  Min.   :  2.00   Min.   : 0.000  
##  1st Qu.:  5.00   1st Qu.: 0.000  
##  Median : 14.00   Median : 3.000  
##  Mean   : 19.75   Mean   : 6.694  
##  3rd Qu.: 28.00   3rd Qu.: 6.000  
##  Max.   :172.00   Max.   :91.000

For subgroups:

**Note:" I do not believe the mean values of subgroups divided by gender or student status(Bechelor/Master) could be equal. What’s wrong?

grouped_df <- learning_w4 %>%
  group_by(studentstatus)

grouped_df %>%
  summarise(unref_mean = mean(learning_w4$unref), deep_mean = mean(learning_w4$deep), orga_mean = mean(learning_w4$deep))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 4
##   studentstatus unref_mean deep_mean orga_mean
##           <int>      <dbl>     <dbl>     <dbl>
## 1             1       2.18      4.01      4.01
## 2             2       2.18      4.01      4.01

We can see studylength (how many month students have been studied in the university) is a better grouping value than (numbers) of writingcourse. But …

Try cluster (student profile based on the combination of scores on ‘unref’, ‘deep’ and ‘orga’)

learning_w4 %>%
  count(learning_w4[1])
##   锘縞luster  n
## 1          1 94
## 2          2 76
## 3          3 36
grouped_learning <- learning_w4 %>%
  group_by(learning_w4[1])

grouped_learning %>%
  summarise(unref_mean = mean(grouped_learning$unref), deep_mean = mean(grouped_learning$deep), orga_mean = mean(grouped_learning$orga))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 4
##   锘縞luster unref_mean deep_mean orga_mean
##        <int>      <dbl>     <dbl>     <dbl>
## 1          1       2.18      4.01      3.41
## 2          2       2.18      4.01      3.41
## 3          3       2.18      4.01      3.41
# the results look strange but I do not know what went wrong

Exercise 5 Visualise my own data

Describe if there’s anything else you think should be done as “pre-processing” steps (e.g. recoding/grouping values, renaming variables, removing variables or mutating new ones, reshaping the data to long format, merging data frames together).

Do you have an idea of what kind of relationships in your data you would like to visualise and for which variables? For example, would you like to depict variable distributions, the structure of multilevel data, summary statistics (e.g. means), or include model fits or predictions?

5.1 Reading the data

Structure of the data

learning2019 <- read.csv(file = "D:/Users/yinyf/datavis-R/week0/learning2019_w4.csv", stringsAsFactors = TRUE) 
learning19 <- learning2019[1:13]
str(learning19)
## 'data.frame':    211 obs. of  13 variables:
##  $ 锘縞luster     : int  3 2 1 1 3 1 2 2 1 3 ...
##  $ unref          : num  4 2 3 2 3 ...
##  $ deep           : num  3.5 4.25 3.75 4.25 3.25 3.5 4.25 4.25 4 4 ...
##  $ orga           : num  3.33 3 4.33 3.67 2.67 ...
##  $ blocks         : num  3.33 3.67 3.67 3 3.67 ...
##  $ procrastination: num  3.25 4.25 3.75 2.5 4.25 3.5 3.5 4.25 3.25 2.5 ...
##  $ perfectionism  : num  3.67 3.33 3.33 2.67 2.33 ...
##  $ innateability  : num  1 1.5 3 1.5 2.5 2 2 1 2.5 1 ...
##  $ ktransforming  : num  4 3.67 3.67 3.33 4 ...
##  $ productivity   : num  1.25 2 1.25 2.25 2.25 2.5 3 2.25 2.25 3.75 ...
##  $ gender         : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ studentstatus  : int  1 1 1 1 1 1 1 1 1 3 ...
##  $ studylength    : int  39 51 3 3 15 3 3 3 3 3 ...

The aim of the study is to investigate the interrelationships between the approaches to learning and conceptions of academic writing among international university students. Altogether 218 international students of the university participated in the study in 2018 and 2019. Students were divided into homogeneous groups based on their Z scores on the three approaches to learning. Then we compare mean differences and ANOVA results between the profiles.

The data ‘learning2019’ consists of 218 observations and 17 variables. It contains their scores of approaches to learning (different ways that students process information: unreflective studying, deep approach to learning and organised studying), conceptions of academic writing (blocks, procrastination, perfectionism, innate ability, knowledge transforming and productivity), and some background information (categorical variables, eg:gender, age, faculty, student status and study length).

The explanation of some columns are as follows. Each of them was average value of 2-4 questions in 5-point Likert scale (1= totally disagree, 5 = fully agree).

  • “unref”: relying on memorisation in the learning process, lacking the reflective approach to studying and applying the fragmented knowledge base.

  • “deep”: comprehending the intentional content, using evidence and integrating with previous knowledge.

  • “orga”: time management, study organisation, effort management and concentration.

  • “blocks”: the inability to write productively whose reason is not intellectual capacity or literary skills.

  • “procrastination”: failing to start or postponing tasks like preparing for exams and doing homework.

  • “perfectionism”: setting overly high standards, pursuing flawlessness, and evaluating one’s behavior critically.

  • “innateability”: writing is a skill which “is determined at birth” or “cannot be taught or developed”.

  • “ktransforming”: (knowledge transforming) using writing for developing knowledge and generating new ideas and in the reflective and dialectic processes.

  • “productivity”: (sense of productivity) part of self-efficacy in writing.

5.2 Exploring the data numerically and graphically

5.2.1 Summaries of the variables

summary(learning19)
##    锘縞luster        unref            deep            orga      
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:1.667   1st Qu.:3.750   1st Qu.:2.667  
##  Median :2.000   Median :2.000   Median :4.000   Median :3.333  
##  Mean   :1.716   Mean   :2.171   Mean   :4.007   Mean   :3.414  
##  3rd Qu.:2.000   3rd Qu.:2.667   3rd Qu.:4.500   3rd Qu.:4.000  
##  Max.   :3.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##      blocks      procrastination perfectionism   innateability  
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:2.500   1st Qu.:2.000   1st Qu.:1.000  
##  Median :2.667   Median :3.250   Median :2.333   Median :1.500  
##  Mean   :2.662   Mean   :3.219   Mean   :2.556   Mean   :1.761  
##  3rd Qu.:3.333   3rd Qu.:3.875   3rd Qu.:3.333   3rd Qu.:2.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##  ktransforming    productivity       gender      studentstatus  
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:3.667   1st Qu.:1.875   1st Qu.:1.000   1st Qu.:3.000  
##  Median :4.000   Median :2.500   Median :2.000   Median :4.000  
##  Mean   :4.041   Mean   :2.487   Mean   :1.716   Mean   :3.185  
##  3rd Qu.:4.667   3rd Qu.:3.250   3rd Qu.:2.000   3rd Qu.:4.000  
##  Max.   :5.000   Max.   :4.750   Max.   :2.000   Max.   :4.000  
##   studylength    
##  Min.   :  2.00  
##  1st Qu.:  5.00  
##  Median : 14.00  
##  Mean   : 21.63  
##  3rd Qu.: 28.50  
##  Max.   :172.00

5.2.2 Relationships between the variables

Calculate and print the correlation matrix

cor_matrix<-cor(learning19[2:10]) %>% round(digits = 2)
cor_matrix
##                 unref  deep  orga blocks procrastination perfectionism
## unref            1.00 -0.48 -0.31   0.33            0.25          0.28
## deep            -0.48  1.00  0.32  -0.27           -0.18         -0.19
## orga            -0.31  0.32  1.00  -0.22           -0.38         -0.14
## blocks           0.33 -0.27 -0.22   1.00            0.55          0.54
## procrastination  0.25 -0.18 -0.38   0.55            1.00          0.35
## perfectionism    0.28 -0.19 -0.14   0.54            0.35          1.00
## innateability    0.16 -0.11 -0.02   0.24            0.13          0.28
## ktransforming   -0.16  0.31  0.16  -0.30           -0.21         -0.25
## productivity    -0.15  0.16  0.30  -0.38           -0.46         -0.22
##                 innateability ktransforming productivity
## unref                    0.16         -0.16        -0.15
## deep                    -0.11          0.31         0.16
## orga                    -0.02          0.16         0.30
## blocks                   0.24         -0.30        -0.38
## procrastination          0.13         -0.21        -0.46
## perfectionism            0.28         -0.25        -0.22
## innateability            1.00         -0.25         0.01
## ktransforming           -0.25          1.00         0.21
## productivity             0.01          0.21         1.00

Specialized according to the significant level and visualize the correlation matrix p.mat <- cor.mtest(cor_matrix)$p

library(corrplot)
## corrplot 0.84 loaded
p.mat <- cor.mtest(cor_matrix)$p
corrplot(cor_matrix, method="circle", type="upper",  tl.cex = 0.6, p.mat = p.mat, sig.level = 0.01, title="Correlations of learning19", mar=c(0,0,1,0))

5.2.3 Creating an x-y scatter plot

learning19 %>%
  ggplot(aes(orga, procrastination, color = cluster)) + # x = orga, y = procrastination
  geom_point()

5.3 K-means clustering

5.3.1 Calculate the distances

Euclidean distance matrix

learning19_eu <- dist(learning19[2:4])
summary(learning19_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.083   1.601   1.741   2.192   6.741

5.3.2 Determine the k

set.seed(123)
k_max <- 5 # determine the number of clusters
twcss <- sapply(1:k_max, function(k){kmeans(learning19[2:4], k)$tot.withinss}) # calculate the total within sum of squares
qplot(x = 1:k_max, y = twcss, geom = 'line') # visualize the results

The twcss value decrease heavily from 2 - 5 clusters. The optimal number of clusters was 3.

5.3.3 Perform k-means clustering

learning19_km <- kmeans(learning19[2:10], centers = 3)

Plot the dataset with clusters

pairs(learning19[2:10], col = learning19_km$cluster)

pairs(learning19[,2:4], col = learning19_km$cluster)

pairs(learning19[,5:10], col = learning19_km$cluster)

The optimal number of clusters was 3. We got the best overview with three clusters.

5.3.4 Perform k-means on the original data

library(devtools)
library(flipMultivariates)
learning19_scaled3 <- scale(learning19[2:4])
learning19_km3 <-kmeans(learning19_scaled3, centers = 3)
cluster <- learning19_km3$cluster
learning19_scaled3 <- data.frame(learning19_scaled3, cluster)
lda.fit_cluster <- lda(cluster ~ ., data = learning19_scaled3)
lda.fit_cluster

Warning in install.packages : package ‘flipMultivariates’ is not available

but I used to run it so I kept the codes.

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
classes3 <- as.numeric(learning19_scaled3$cluster)
plot(lda.fit_cluster, dimen = 2, col = classes3, pch = classes3, main = "LDA biplot using three clusters")
lda.arrows(lda.fit_cluster, myscale = 2)

5.3.5 3D plot

model_predictors <- dplyr::select(learning19_train, -deep2)
# check the dimensions
dim(model_predictors)
dim(lda.fit$scaling)
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

Next, install and access the plotly package.

Create a 3D plot of the columns of the matrix product.

library(plotly)
plot_ly (x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = learning19_train$deep2)
library(plot3D)

scatter3D(x = learning19$unref, y = learning19$deep, z = learning19$orga, col = NULL, 
          main = "learning19 data", xlab = "deep",
          ylab ="unref", zlab = "orga")

library(plotly)
plot_ly (x = learning19$unref, y = learning19$deep, z = learning19$orga, type= 'scatter3d', mode='markers', color = learning19$deep)

Week 5 Exercises

yufan_yin_week5: 13.10. - 19.10.2020

Read the file timeuse_tidy.rds with readRDS(). The file contains the dataset that we tidied in the exercise session: records of daily time use from participants over multiple days. Note that since the data has been stored as rds (R-specific format), column types and factor levels are as we left them, and don’t need to be re-corrected.

readRDS(file = "D:/Users/yinyf/datavis-R/week5/timeuse_tidy.rds")
## # A tibble: 26,568 x 9
##    indivID date       female   age occ_full_time activity_class time_spent
##    <chr>   <date>     <fct>  <dbl> <fct>         <fct>               <dbl>
##  1 1013302 2016-11-11 0         70 0             Lifts                   0
##  2 1013302 2016-11-11 0         70 0             Work                    0
##  3 1013302 2016-11-11 0         70 0             Education               0
##  4 1013302 2016-11-11 0         70 0             Shopping                0
##  5 1013302 2016-11-11 0         70 0             Business                0
##  6 1013302 2016-11-11 0         70 0             Petrol                  0
##  7 1013302 2016-11-11 0         70 0             Social / Leis~          0
##  8 1013302 2016-11-11 0         70 0             Vacation                0
##  9 1013302 2016-11-11 0         70 0             Exercise                6
## 10 1013302 2016-11-11 0         70 0             Home                 1424
## # ... with 26,558 more rows, and 2 more variables: weekday <ord>,
## #   week_number <dbl>
df <- readRDS(file = "D:/Users/yinyf/datavis-R/week5/timeuse_tidy.rds")
summary(df)
##    indivID               date            female         age       
##  Length:26568       Min.   :2016-10-14   0:12024   Min.   :21.00  
##  Class :character   1st Qu.:2016-11-14   1:14544   1st Qu.:34.50  
##  Mode  :character   Median :2016-12-02             Median :34.50  
##                     Mean   :2016-11-26             Mean   :40.51  
##                     3rd Qu.:2016-12-11             3rd Qu.:54.50  
##                     Max.   :2016-12-27             Max.   :80.00  
##                                                                   
##  occ_full_time   activity_class    time_spent   weekday     week_number   
##  0: 8460       Business : 2214   Min.   :   0   Mon:3828   Min.   :42.00  
##  1:18108       Education: 2214   1st Qu.:   0   Tue:3912   1st Qu.:46.00  
##                Exercise : 2214   Median :   0   Wed:3708   Median :49.00  
##                Home     : 2214   Mean   : 120   Thu:3276   Mean   :47.83  
##                Lifts    : 2214   3rd Qu.:  19   Fri:3528   3rd Qu.:50.00  
##                Petrol   : 2214   Max.   :1440   Sat:3876   Max.   :52.00  
##                (Other)  :13284                  Sun:4440

Exercise 1

1.1

1.1.1 Create a new variable that contains combined activity classes

Create a new variable that contains combined activity classes: “Work or school” (Work, Business, Education), “Free time” (Shopping, Social / Leisure, Home, Vacation), and “Other”.

df <- df %>%
  mutate(activity_class = as.character(activity_class))
df_wide <- df %>%
  group_by(activity_class) %>%
  mutate(row = row_number()) %>%
  ungroup %>%
  spread(activity_class, time_spent) %>%
  select(-row) #long to wide

head(df_wide)
## # A tibble: 6 x 19
##   indivID date       female   age occ_full_time weekday week_number Business
##   <chr>   <date>     <fct>  <dbl> <fct>         <ord>         <dbl>    <dbl>
## 1 1013302 2016-11-11 0         70 0             Fri              46        0
## 2 1013302 2016-11-12 0         70 0             Sat              46        0
## 3 1013302 2016-11-13 0         70 0             Sun              46        0
## 4 1013302 2016-11-14 0         70 0             Mon              46        0
## 5 1013302 2016-11-15 0         70 0             Tue              46        0
## 6 1013302 2016-11-16 0         70 0             Wed              46        0
## # ... with 11 more variables: Education <dbl>, Exercise <dbl>, Home <dbl>,
## #   Lifts <dbl>, `Non-Allocated` <dbl>, Petrol <dbl>, Shopping <dbl>, `Social /
## #   Leisure` <dbl>, Travel <dbl>, Vacation <dbl>, Work <dbl>
df_long1 <- df_wide %>%
  gather(Free_time, value2, `Shopping`, `Social / Leisure`, `Home`, `Vacation`) #wide to long, I did not know the more concise way and had to do for 3 times

df_long2 <- df_long1 %>%
  gather(Work_or_school, value1, Work, Business, Education)

df2 <- df_long2 %>%
  gather(Other, value3, Exercise:Travel)

head(df2) # the final results should be two columns ('activity_class' and 'time_spent'). Maybe rename column or values and then convert wide to long for one or two times. However, I could not figure out.
## # A tibble: 6 x 13
##   indivID date       female   age occ_full_time weekday week_number Free_time
##   <chr>   <date>     <fct>  <dbl> <fct>         <ord>         <dbl> <chr>    
## 1 1013302 2016-11-11 0         70 0             Fri              46 Shopping 
## 2 1013302 2016-11-12 0         70 0             Sat              46 Shopping 
## 3 1013302 2016-11-13 0         70 0             Sun              46 Shopping 
## 4 1013302 2016-11-14 0         70 0             Mon              46 Shopping 
## 5 1013302 2016-11-15 0         70 0             Tue              46 Shopping 
## 6 1013302 2016-11-16 0         70 0             Wed              46 Shopping 
## # ... with 5 more variables: value2 <dbl>, Work_or_school <chr>, value1 <dbl>,
## #   Other <chr>, value3 <dbl>

1.1.2 Calculating the means by grouping variable

Calculate the mean time spent on each of the combined activity classes, grouped by weekday, participant ID, and occ_full_time.

grouped_df2 <- df2 %>%
  group_by(weekday)

grouped_df2 %>%
  summarise(Work_or_school_mean = mean(value1), Free_time_mean = mean(value2), Other_mean = mean(value3))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 7 x 4
##   weekday Work_or_school_mean Free_time_mean Other_mean
##   <ord>                 <dbl>          <dbl>      <dbl>
## 1 Mon                    87.2           253.       33.6
## 2 Tue                    98.6           246.       32.3
## 3 Wed                   106.            233.       38.6
## 4 Thu                   104.            236.       36.9
## 5 Fri                    88.4           249.       35.8
## 6 Sat                    18.9           304.       33.5
## 7 Sun                    15.0           312.       29.3
grouped_df2 <- df2 %>%
  group_by(indivID)

grouped_df2 %>%
  summarise(Work_or_school_mean = mean(value1), Free_time_mean = mean(value2), Other_mean = mean(value3))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 356 x 4
##    indivID Work_or_school_mean Free_time_mean Other_mean
##    <chr>                 <dbl>          <dbl>      <dbl>
##  1 1013302              28.1             326.     10.7  
##  2 1056237               0               359.      0.943
##  3 1103940              42.1             310.     14.9  
##  4 118068               95.5             256.     25.7  
##  5 1198262              82.9             282.     12.6  
##  6 1202035              87.9             204.     72.2  
##  7 121881                0.238           323.     29.7  
##  8 1226238              68               264.     35.8  
##  9 1292043              87.6             268.     20.6  
## 10 1326897              54.8             267.     41.6  
## # ... with 346 more rows
grouped_df2 <- df2 %>%
  group_by(occ_full_time)

grouped_df2 %>%
  summarise(Work_or_school_mean = mean(value1), Free_time_mean = mean(value2), Other_mean = mean(value3))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 4
##   occ_full_time Work_or_school_mean Free_time_mean Other_mean
##   <fct>                       <dbl>          <dbl>      <dbl>
## 1 0                            47.2           287.       30.2
## 2 1                            83.0           253.       35.9

1.2 Visualision

Visualise the means you calculated.

If I had got the right results in 1.1, the code here should be:

```fig.width=10, fig.height=8 df2 %>% ggplot(aes(activity_class2, time_spent, group = weekday, colour = weekday)) + geom_point()+ facet_wrap(~activity_class2) labs(x = “activity type”, y = “Average time spent (minutes)”, colour = “Activity type”) +

df2 %>% ggplot(aes(activity_class2, time_spent, group = indivID, colour = indivID)) + geom_point()+ facet_wrap(~activity_class2) labs(x = “activity type”, y = “Average time spent (minutes)”, colour = “Activity type”) +

df2 %>% ggplot(aes(activity_class2, time_spent, group = occ_full_time, colour = occ_full_time)) + geom_point()+ facet_wrap(~activity_class2) labs(x = “activity type”, y = “Average time spent (minutes)”, colour = “Activity type”) +

df2 %>% ggplot(aes(weekday, time_spent, group = week_number, color = activity_class)) + geom_line(size=1, alpha = .1) + geom_point(alpha = .6) + facet_wrap(~activity_class, scales = “free_y”) + labs(x = “Weekday”, y = “Average time spent (minutes)”, color = “Activity type”) + theme_bw() + theme(legend.position = “none”)


Now I have to only use 'Work_or_school' as an example


```r
df2 %>%
  ggplot(aes(weekday, value1, group = week_number, color = Work_or_school)) + 
  geom_line(size=1, alpha = .1) +
  geom_point(alpha = .6) +
  facet_wrap(~Work_or_school, scales = "free_y") +
  labs(x = "Weekday", y = "Average time spent (minutes)", colour = "Activity type") + 
  theme_bw() +
  theme(legend.position = "none")

Exercise 2

2.1

What is computed in the code chunk below - what do the numbers tell you?

Can you think of another way to calculate the same thing?

df2 %>%
  distinct(indivID, date) %>%
  arrange(date) %>%
  count(date)
## # A tibble: 73 x 2
##    date           n
##    <date>     <int>
##  1 2016-10-14     6
##  2 2016-10-15    11
##  3 2016-10-16    10
##  4 2016-10-17    10
##  5 2016-10-18    14
##  6 2016-10-19    16
##  7 2016-10-20    11
##  8 2016-10-21    11
##  9 2016-10-22    17
## 10 2016-10-23    18
## # ... with 63 more rows
grouped_df2 <- df2 %>%
  group_by(date)

grouped_df2 %>%
  summarise(n = n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 73 x 2
##    date           n
##    <date>     <int>
##  1 2016-10-14   360
##  2 2016-10-15   660
##  3 2016-10-16   600
##  4 2016-10-17   600
##  5 2016-10-18   840
##  6 2016-10-19   960
##  7 2016-10-20   660
##  8 2016-10-21   660
##  9 2016-10-22  1020
## 10 2016-10-23  1080
## # ... with 63 more rows

My excercise ended here. I may continue later.

The direct reason is that the button ‘run’ disappeared when I did excercise after 2.2 and I could not run any chunk. However, the main reason is that I did not keep the pace of the lecture at all last week. I did not know why it was so difficult/abstract to understand. By the way, my study is cross-sectional and there is no variable regarding of time.

I would like to say something about the course.

The first one is about the appproach to teaching. Lacking of interaction impairs the quality of teaching.

I do not mean the timely Q&A during you had already done in the lectures. In the online UH MOOC last year (https://mooc.helsinki.fi/course/view.php?id=273&lang=en; it has been run for many rounds) or video remote R course this semester (friends are taking), thanks to the interactive applets, short instructional video (can be stopped anytime) and active forum helpping each other, we had time to understand, digest and solve most of the problems. Considering the size and type of the course, I know some of them are unrealistic for ours. But the impact does exist.

The second one is about the assessment. The grade every week is a little bit strict.

It is pass/fail. At the same time, ‘a valiant effort without full completion’ gives half of the points. The criteria is reasonable for a 2-credit course but this one is 5-credit and intensive. In other 5-credit R course, either every task is graded by 5 points, or the criteria is not harsh.

If there is only one or two wrong words (eg.week2ex3, week4ex4, the means are equal between the groups because I used df&variable. The name of one variable start with a garbled code and I made the same choice when other could be found directly), in practice, it cannot work at all. But it is in a course, it lead to 1/2 points, which was the same as I wrote a chunk ending with a incomplete plot or even without drawing. Therefore it is so easy to be on the edge of losing all the 5 credits, like me (got 19/36 points before this week).

I hesitated to write those above. Any course on R is tough and full of error. I am not sure how many students have the similar confusion. Moreover, we have already been a doctoral students and do not need to value credits too much. I just do not want to give up without struggling (and my field is teaching and learning in higher education): since the aim of attending courses is to learn something, shall a student stop when he/she heard something but was unable to master it? It sounds like the course and teachers abandon the participants without communication, as long as they did not keep the pace.

  • Thank you for the explanation below the grade. I made some corrections in previous exercises(only seen in the page of course diary: https://yufanyin.github.io/datavis-R/; reupload in Moodle after so many days is not proper)

Either for acquiring skills or credits, I hope I can continue attending this course.

2.2

Plot the numbers from above (use points, lines, or whatever you think is suitable).

df2 %>%
  ggplot(aes(date, indivID)) +
  geom_point()

Exercise 3

3.1

Count the total number of participants in the data.

3.2

For each participant, count the number of separate days that they recorded their time use on.

Exercise 4

Explain step by step what happens in the code chunk below, and what the final figure represents.

df2 %>%
  group_by(indivID) %>%
  mutate(start_date = min(date)) %>%
  ungroup %>%
  mutate(indivID = factor(indivID),
         indivID = fct_reorder(indivID, start_date) %>% fct_rev()) %>%
  ggplot(aes(date, indivID, colour = month(start_date, label = T))) + 
  geom_line() + 
  geom_point(size=.5, alpha=.1) +
  theme_bw() + 
  scale_y_discrete(breaks = "none") +
  labs(x = "Date", y = "", colour = "Starting month")

Week 6 Exercises

yufan_yin_week6: 20.10. - 27.10.2020

Also see in the page to my course diary: https://yufanyin.github.io/datavis-R/

Exercise 1

The data frames df_w and df_f represent repeated measures data from 60 participants. Variables F1-F3 and W1-W3 are “sub-variables” that will be used to make two composite variables F_total and W_total, respectively.

1.1

Merge the two data frames together.

df_f <- df_f  %>% 
  mutate(session = as.factor(session)) # many errors occurred when I tried to change the type of 'session' and 'group'. Q: I still did not understand why only factor works.

df <- full_join(df_f, df_w, by = c("id" = "id", "session" = "session", "group" = "group"), suffix = c("_f", "_w")) 

head(df)
##   id session group F1 F2 F3 W1 W2 W3
## 1  1       2     1  0  0  3  1  0  0
## 2  1       1     1  3  0  0  2  3  2
## 3  2       2     1  2  0  2  3  2  0
## 4  2       1     1  0  0  0  0  2  1
## 5  3       2     1  1  0  0  0  3  3
## 6  3       1     1  0  0  0  3  2  0

1.2

Using the merged data frame, create the composite variables F_total and W_total, which are the sums of F1-F3 and W1-W3, respectively (i.e. their values can range from 0 to 9).

df$F_total <- rowSums(df[, c('F1', 'F2', 'F3')])

df$W_total <- rowSums(df[, c('W1', 'W2', 'W3')]) 

# I searched all the material and did not find row or column sums were taught. Why it could be an exercise?

head(df)
##   id session group F1 F2 F3 W1 W2 W3 F_total W_total
## 1  1       2     1  0  0  3  1  0  0       3       1
## 2  1       1     1  3  0  0  2  3  2       3       7
## 3  2       2     1  2  0  2  3  2  0       4       5
## 4  2       1     1  0  0  0  0  2  1       0       3
## 5  3       2     1  1  0  0  0  3  3       1       6
## 6  3       1     1  0  0  0  3  2  0       0       5

Exercise 2

2.1

Visualise the distributions of F_total and W_total for the two groups and measurement sessions (for example as boxplots).

df %>%
  ggplot(aes(session, F_total)) + 
  geom_boxplot() +
  facet_wrap(~group)

df %>%
  ggplot(aes(session, W_total)) + 
  geom_boxplot() +
  facet_wrap(~group)

# try more
df %>%
  ggplot(aes(session, F_total)) + 
  geom_violin() +
  geom_dotplot(binaxis = "y", stackdir = "center", alpha = .3, binwidth = .1) +
  facet_wrap(~group)

# Q: Is binwidth set without specific standard ('exploring multiple widths to find the best to illustrate the stories in your data')? I found the pots was too big when binwidth = 1 (according to data_wrangling_and_plotting_week3)?.

df %>%
  ggplot(aes(session, W_total)) + 
  geom_violin() +
  geom_jitter(alpha = .3) +
  facet_wrap(~group)

# Q: Is the distribution is the original without any calculating or rotation? If so, I prefer this plot to the next one.

# ['The jitter geom is a convenient shortcut for geom_point(position = "jitter"). It adds a small amount of random variation to the location of each point, and is a useful way of handling overplotting caused by discreteness in smaller datasets.']

df %>%
  ggplot(aes(session, W_total)) + 
  geom_violin() +
  geom_dotplot(binaxis = "y", stackdir = "center", alpha = .3, binwidth = .1) +
  facet_wrap(~group)

# ['stackdir: which direction to stack the dots. "up" (default), "down", "center", "centerwhole" (centered, but with dots aligned)']

2.2

Fit a linear regression model with F_total as the DV, and session and group as predictors.

# where is 'DV'?
# LM with an interaction effect
F_total.model.1 <- lm(F_total ~ session * group, data = df)

summary(F_total.model.1)
## 
## Call:
## lm(formula = F_total ~ session * group, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3000 -0.8667 -0.3000  0.7000  3.7000 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1.4000     0.2740   5.109 1.28e-06 ***
## session2          0.9000     0.3875   2.323   0.0219 *  
## group2            1.9000     0.3875   4.903 3.10e-06 ***
## session2:group2  -1.3333     0.5480  -2.433   0.0165 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.501 on 116 degrees of freedom
## Multiple R-squared:  0.1883, Adjusted R-squared:  0.1673 
## F-statistic: 8.969 on 3 and 116 DF,  p-value: 2.163e-05
F_total_coef <- broom::tidy(F_total.model.1) %>%
  select(term, estimate) %>%
  mutate(estimate = round(estimate, 2)) %>% # round decimals for plot text
  spread(term, estimate) %>%
  rename(Intercept = `(Intercept)`,
         group_coef = group,
         session_coef = session,
         session_group_coef = `session:group`)

error: Can't rename columns that don't exist. x Column `group` doesn't exist.
# I failed to rename the columns above so this chunk had error. I have to delet {r} to retain the codes.

(F_total_plot <- broom::augment(F_total.model.1, se_fit = T) %>%
  ggplot(aes(session, F_total)) +
  geom_point(aes(color = group), alpha = .7) + 
  geom_line(aes(session, .fitted, color = group), size = 1) + 
  geom_ribbon(aes(ymin=.fitted-1.96*.se.fit, ymax=.fitted+1.96*.se.fit, fill = group), alpha=0.2) +
  theme_bw())

# plot annotations
F_total_plot +
  geom_point(aes(0, F_total_coef$Intercept)) + # mark the intercept point
  geom_text(aes(0.35, F_total_coef$Intercept, 
                label = paste("Intercept =", F_total_coef$Intercept)), vjust=-.9) +
  geom_text(aes(4.2, F_total_coef$Intercept + F_total_coef$session_coef * 4.2, # annotate session coefficient 
                label = paste("Slope =", F_total_coef$session_coef)),
            vjust = -.9) +
  geom_segment(aes(x = 1.3, y = F_total_coef$Intercept + F_total_coef$session_coef * 1.3, # draw arrow to mark gender coefficient
                   xend = 1.3, yend = F_total_coef$Intercept + F_total_coef$session_coef * 1.3 + F_total_coef$group_coef * 1), 
               arrow = arrow()) + 
  geom_text(aes(1.3, F_total_coef$Intercept + F_total_coef$session_coef * 1.3 + F_total_coef$gender_coef * 1, 
                label = paste("Female coef =", F_total_coef$group_coef)),
            vjust = 2, hjust = 1.1) +
   geom_segment(aes(x = 1.3, y = F_total_coef$Intercept + F_total_coefsession_coef * 1.3 + F_total_coef$group_coef * 1,
                    xend = 1.3, yend = F_total_coef$Intercept + F_total_coef$session_coef * 1.3 + GPA_coef$group_coef * 1 +  F_total_coef$session_group_coef * 1.3), 
               arrow = arrow()) +
   geom_text(aes(1.3, F_total_coef$Intercept + F_total_coef$session_coef * 1.3 + F_total_coef$group_coef * 1 + F_total_coef$session_group_coef * 1.3, 
                label = paste("Interaction coef =", F_total_coef$session_group_coef)),
            vjust = 2, hjust = 1.1) 

2.3

Look at the means of F_total by group and session. How are they linked to the linear regression model coefficients?

grouped_F_total1 <- df %>%
  group_by(group)

grouped_F_total1 %>%
  summarise(F_total1 = mean(F_total))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
##   group F_total1
##   <fct>    <dbl>
## 1 1         1.85
## 2 2         3.08
grouped_F_total2 <- df %>%
  group_by(session)

grouped_F_total2 %>%
  summarise(F_total2 = mean(F_total))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
##   session F_total2
##   <fct>      <dbl>
## 1 1           2.35
## 2 2           2.58

Exercise 3

Visualise the anscombe dataset using ggplot2.

anscombe
##    x1 x2 x3 x4    y1   y2    y3    y4
## 1  10 10 10  8  8.04 9.14  7.46  6.58
## 2   8  8  8  8  6.95 8.14  6.77  5.76
## 3  13 13 13  8  7.58 8.74 12.74  7.71
## 4   9  9  9  8  8.81 8.77  7.11  8.84
## 5  11 11 11  8  8.33 9.26  7.81  8.47
## 6  14 14 14  8  9.96 8.10  8.84  7.04
## 7   6  6  6  8  7.24 6.13  6.08  5.25
## 8   4  4  4 19  4.26 3.10  5.39 12.50
## 9  12 12 12  8 10.84 9.13  8.15  5.56
## 10  7  7  7  8  4.82 7.26  6.42  7.91
## 11  5  5  5  8  5.68 4.74  5.73  6.89
anscombe$x_total <- rowSums(anscombe[, c('x1', 'x2', 'x3', 'x4')])

anscombe$y_total <- rowSums(anscombe[, c('y1', 'y2', 'y3', 'y4')]) 

head(anscombe)
##   x1 x2 x3 x4   y1   y2    y3   y4 x_total y_total
## 1 10 10 10  8 8.04 9.14  7.46 6.58      38   31.22
## 2  8  8  8  8 6.95 8.14  6.77 5.76      32   27.62
## 3 13 13 13  8 7.58 8.74 12.74 7.71      47   36.77
## 4  9  9  9  8 8.81 8.77  7.11 8.84      35   33.53
## 5 11 11 11  8 8.33 9.26  7.81 8.47      41   33.87
## 6 14 14 14  8 9.96 8.10  8.84 7.04      50   33.94

Final Assignment

Prepare a separate R Notebook/Markdown document, which will be the first draft of your final assignment with your own data. In the draft, include the following:

  1. Outline the study design, your research question, DV(s), IV(s)
  2. Data wrangling: start from reading in the raw data and show all steps
  3. Plot your main result(s)

Even if you had already completed some of these steps before, please include all of them in your document. NOTE: Return either a readable HTML document (.html or .nb.html), or an .Rmd file along with your data, to make it possible for us to review your work! Make the document as professional-looking as possible (you can, of course, include your comments/questions in the draft). You will get feedback on the draft, based on which you can then make the final version. The final document should be a comprehensive report of your data wrangling process and results.

7.1 Reading the data

Structure of the data

learning2019 <- read.csv(file = "D:/Users/yinyf/datavis-R/week0/learning2019_w4.csv", stringsAsFactors = TRUE) 
learning19 <- learning2019[1:13]
str(learning19)
## 'data.frame':    211 obs. of  13 variables:
##  $ 锘縞luster     : int  3 2 1 1 3 1 2 2 1 3 ...
##  $ unref          : num  4 2 3 2 3 ...
##  $ deep           : num  3.5 4.25 3.75 4.25 3.25 3.5 4.25 4.25 4 4 ...
##  $ orga           : num  3.33 3 4.33 3.67 2.67 ...
##  $ blocks         : num  3.33 3.67 3.67 3 3.67 ...
##  $ procrastination: num  3.25 4.25 3.75 2.5 4.25 3.5 3.5 4.25 3.25 2.5 ...
##  $ perfectionism  : num  3.67 3.33 3.33 2.67 2.33 ...
##  $ innateability  : num  1 1.5 3 1.5 2.5 2 2 1 2.5 1 ...
##  $ ktransforming  : num  4 3.67 3.67 3.33 4 ...
##  $ productivity   : num  1.25 2 1.25 2.25 2.25 2.5 3 2.25 2.25 3.75 ...
##  $ gender         : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ studentstatus  : int  1 1 1 1 1 1 1 1 1 3 ...
##  $ studylength    : int  39 51 3 3 15 3 3 3 3 3 ...
**The aim of the study** is to investigate the interrelationships between the approaches to learning and conceptions of academic writing among international university students. The research questions are as follows:

1) What kinds of conceptions of academic writing (blocks, procrastination, perfectionism, innate ability, knowledge transforming and productivity) the international students have?

2) What kinds of approaches to learning (deep approach to learning, unreflective studying and organised studying) the international students apply in their studies and what learning profiles can be identified?

3) What are the differences in conceptions of academic writing between the learning profiles?

Altogether 218 international students of the university participated in the study in 2018 and 2019. Students were divided into homogeneous groups based on their Z scores on the three approaches to learning. Then we compare mean differences and ANOVA results between the profiles.
The data were collected with Writing Process Questionnaire (Lonka, 2003; Lonka et al., 2014) and HowULearn Questionnaire (Parpala & Lindblom-Ylänne, 2012; Hailikari & Parpala, 2014). Both of them are the 5-point Likert scale and have been validated in Finnish and other contexts. 

**The data 'learning2019'** consists of 218 observations and 17 variables. It contains their scores of approaches to learning (different ways that students process information: unreflective studying, deep approach to learning and organised studying), conceptions of academic writing (blocks, procrastination, perfectionism, innate ability, knowledge transforming and productivity), and some background information (categorical variables, eg:gender, age, faculty, student status and study length).

**The explanation of some columns** are as follows. Each of them was average value of 2-4 questions in 5-point Likert scale (1= totally disagree, 5 = fully agree; hence it means nothing if I calculate the sum of A1, A2 and A3 or AW1 - AW6).

- "cluster": the membership after doing k-means clustering analysis. 1) Reflective and organised students (N=97, 45.1%), 2) Reflective and unorganised students (N=78, 36.3%), and 3) Unreflective and unorganised students (N=40, 18.6%)

- "unref": relying on memorisation in the learning process, lacking the reflective approach to studying and applying the fragmented knowledge base.

- "deep": comprehending the intentional content, using evidence and integrating with previous knowledge.

- "orga": time management, study organisation, effort management and concentration.

- "blocks": the inability to write productively whose reason is not intellectual capacity or literary skills.

- "procrastination": failing to start or postponing tasks like preparing for exams and doing homework.

- "perfectionism": setting overly high standards, pursuing flawlessness, and evaluating one’s behavior critically.

- "innateability": writing is a skill which "is determined at birth" or "cannot be taught or developed".

- "ktransforming": (knowledge transforming) using writing for developing knowledge and generating new ideas and in the reflective and dialectic processes.

- "productivity": (sense of productivity) part of self-efficacy in writing.

7.2 Relationships between the variables: Correlation matrix

Calculate and print the correlation matrix

cor_matrix<-cor(learning19[2:10]) %>% round(digits = 2)
cor_matrix
##                 unref  deep  orga blocks procrastination perfectionism
## unref            1.00 -0.48 -0.31   0.33            0.25          0.28
## deep            -0.48  1.00  0.32  -0.27           -0.18         -0.19
## orga            -0.31  0.32  1.00  -0.22           -0.38         -0.14
## blocks           0.33 -0.27 -0.22   1.00            0.55          0.54
## procrastination  0.25 -0.18 -0.38   0.55            1.00          0.35
## perfectionism    0.28 -0.19 -0.14   0.54            0.35          1.00
## innateability    0.16 -0.11 -0.02   0.24            0.13          0.28
## ktransforming   -0.16  0.31  0.16  -0.30           -0.21         -0.25
## productivity    -0.15  0.16  0.30  -0.38           -0.46         -0.22
##                 innateability ktransforming productivity
## unref                    0.16         -0.16        -0.15
## deep                    -0.11          0.31         0.16
## orga                    -0.02          0.16         0.30
## blocks                   0.24         -0.30        -0.38
## procrastination          0.13         -0.21        -0.46
## perfectionism            0.28         -0.25        -0.22
## innateability            1.00         -0.25         0.01
## ktransforming           -0.25          1.00         0.21
## productivity             0.01          0.21         1.00

Specialized according to the significant level and visualize the correlation matrix p.mat <- cor.mtest(cor_matrix)$p

library(corrplot)
p.mat <- cor.mtest(cor_matrix)$p
corrplot(cor_matrix, method="circle", type="upper",  tl.cex = 0.6, p.mat = p.mat, sig.level = 0.01, title="Correlations of learning19", mar=c(0,0,1,0))

7.3 Learning profiles (based on the combination of approaches to learning)

7.3.1 Means of approaches to learning

learning19$锘縞luster <- factor(learning19$锘縞luster, levels = c("1", "2" , "3"),
                 labels = c("Reflective and organised students", "Reflective and unorganised students", "Unreflective and unorganised students"))

learning19$gender <- factor(learning19$gender, levels = c("1", "2"),
                 labels = c("Male", "Female"))
grouped_df1 <- learning19 %>%
  group_by(锘縞luster)

grouped_df1 %>%
  summarise(unref_mean = mean(unref), deep_mean = mean(deep), orga_mean = mean(orga))
## # A tibble: 3 x 4
##   锘縞luster                            unref_mean deep_mean orga_mean
##   <fct>                                      <dbl>     <dbl>     <dbl>
## 1 Reflective and organised students           1.88      4.21      4.23
## 2 Reflective and unorganised students         2.00      4.15      2.74
## 3 Unreflective and unorganised students       3.29      3.20      2.66

7.3.2 Boxplot: means of approaches to learning

learning19 %>%
  ggplot(aes(锘縞luster, unref)) + 
  geom_boxplot() +
  facet_wrap(~gender) + 
  labs(x = "Cluster", y = "Unreflective studying", title = "Means of unreflective studying") + 
  theme(axis.text.x = element_text(angle = 15))

learning19 %>%
  ggplot(aes(锘縞luster, deep)) + 
  geom_boxplot() +
  facet_wrap(~gender) + 
  labs(x = "Cluster", y = "Deep approach to learning", title = "Means of deep approach to learning") + 
  theme(axis.text.x = element_text(angle = 15))

learning19 %>%
  ggplot(aes(锘縞luster, orga)) + 
  geom_boxplot() +
  facet_wrap(~gender) + 
  labs(x = "Cluster", y = "Organised studying", title = "Means of organised studying") + 
  theme(axis.text.x = element_text(angle = 15))

# the descriptive labels of cluster were too long to display. I tried some codes searching online and finally they worked.

7.4 Differences in conceptions of academic writing between the learning profiles

7.4.1 Means of conceptions of academic writing

grouped_df2 <- learning19 %>%
  group_by(锘縞luster)

grouped_df2 %>%
  summarise(blocks_mean = mean(blocks), proc_mean = mean(procrastination), perf_mean = mean(perfectionism), inab_mean = mean(innateability), ktrans_mean = mean(ktransforming), produ_mean = mean(productivity))
## # A tibble: 3 x 7
##   锘縞luster    blocks_mean proc_mean perf_mean inab_mean ktrans_mean produ_mean
##   <fct>               <dbl>     <dbl>     <dbl>     <dbl>       <dbl>      <dbl>
## 1 Reflective a~        2.47      2.85      2.44      1.75        4.12       2.72
## 2 Reflective a~        2.66      3.45      2.50      1.64        4.06       2.36
## 3 Unreflective~        3.15      3.70      2.97      2.05        3.80       2.16

7.4.2 X-y scatter plot: orga-procrastination;orga-productivity

learning19 %>%
  ggplot(aes(orga, procrastination, color = 锘縞luster)) + # x = orga, y = procrastination
  geom_jitter(alpha = .5) +
  labs(x = "Organised studying", y = "Procrastination", title = "Organised studying - procrastination in different learning profiles") + 
  geom_point()

learning19 %>%
  ggplot(aes(orga, productivity, color = 锘縞luster)) + # x = orga, y = productivity
  geom_jitter(alpha = .5) +
  labs(x = "Organised studying", y = "Productivity", title = "Organised studying - productivity in different learning profiles") + 
  geom_point()

7.4.3 Bar plots: cluster, blocks

learning19 %>% 
  ggplot(aes(锘縞luster, blocks, fill = gender)) + 
  geom_col(position = "dodge") + 
  labs(x = "Cluster", y = "Blocks", title = "Blocks in different gender and learning profiles") + 
  theme(axis.text.x = element_text(angle = 15)) +
  geom_smooth(method = "lm")

7.4.4 Bar plots: deep, perfectionism

grouped_df3 <- learning19 %>%
  group_by(deep)

grouped_df3 %>%
  summarise(perfectionism_mean2 = mean(perfectionism))
## # A tibble: 14 x 2
##     deep perfectionism_mean2
##    <dbl>               <dbl>
##  1  1                   2.33
##  2  2                   3.33
##  3  2.25                4.5 
##  4  2.5                 2   
##  5  2.75                3.33
##  6  3                   2.67
##  7  3.25                2.64
##  8  3.5                 2.64
##  9  3.75                2.64
## 10  4                   2.61
## 11  4.25                2.53
## 12  4.5                 2.43
## 13  4.75                2.41
## 14  5                   2.03
learning19 %>% 
  ggplot(aes(deep, perfectionism, fill = 锘縞luster)) + 
  geom_col(position = "dodge", alpha = .5) + 
  labs(x = "Deep approach to learning", y = "Perfectionism", title = "Deep approach to learning and perfectionism in different learning profiles")

# the values of y-axis are the means of perfectionism per each level of deep approach. However, I was unable to use the calculating result or summary_stat().

7.4.5 One-way ANOVA test: blocks

one of the findings is that students applying deep approach to learning experienced less blocks and perfectionism and tend not to regard academic writing as an innate ability.

Take deep~blocks as an example:

aov_blocks <- aov(blocks ~ 锘縞luster, data = learning19)
summary(aov_blocks)
##              Df Sum Sq Mean Sq F value  Pr(>F)    
## 锘縞luster    2  12.35   6.173   8.048 0.00043 ***
## Residuals   208 159.54   0.767                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

7.4.6 Simple regression with lm(): blocks ~ deep

lm_blocks_deep <- lm(blocks~deep, data = learning19)
summary(lm_blocks_deep)
## 
## Call:
## lm(formula = blocks ~ deep, data = learning19)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.96115 -0.56593  0.00193  0.56979  2.19954 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.24560    0.39391  10.778  < 2e-16 ***
## deep        -0.39522    0.09715  -4.068 6.72e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.873 on 209 degrees of freedom
## Multiple R-squared:  0.07337,    Adjusted R-squared:  0.06894 
## F-statistic: 16.55 on 1 and 209 DF,  p-value: 6.719e-05

There is a statistical relationship between deep approach to learning and blocks (p:6.72e-05). Then draw a scatter plot, fit a linear model and print out the summary.

qplot(deep, blocks, data = learning19) + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

model_blocks_deep <- lm(blocks ~ deep, data = learning19)
model_blocks_deep
## 
## Call:
## lm(formula = blocks ~ deep, data = learning19)
## 
## Coefficients:
## (Intercept)         deep  
##      4.2456      -0.3952

7.4.7 Multiple regression: blocks ~ deep + orga

This multiple regression is to test whether organised studying has an influence on blocks.

model_blocks_deep_orga <- lm(blocks ~ deep + orga, data = learning19)
summary(model_blocks_deep_orga)
## 
## Call:
## lm(formula = blocks ~ deep + orga, data = learning19)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.92137 -0.63187  0.03831  0.57286  2.29678 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.45936    0.40387  11.042  < 2e-16 ***
## deep        -0.32756    0.10164  -3.223  0.00147 ** 
## orga        -0.14203    0.06781  -2.094  0.03743 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.866 on 208 degrees of freedom
## Multiple R-squared:  0.09251,    Adjusted R-squared:  0.08378 
## F-statistic:  10.6 on 2 and 208 DF,  p-value: 4.126e-05

The p-value (0.03) shows that the significance of the influence was statistical. The multiple R-squared (0.09) is a bit higher than that in regression model of blocks ~ deep (0.07). That means a little higher correlations if organised studying is taken into account.

I used to conduct modelling of approaches to learning and other variables in the data collected with the same questionnaire. The multiple R-squared was 0.20 so I do not know whether 0.09 is insufficient.

Thank you for your kind feedback these weeks.